Voice Conversion Using Articulatory Features
نویسنده
چکیده
The aim of voice conversion is to transform an utterance spoken by an arbitrary (source) speaker to that of a specific (target) speaker. Text-to-speech (TTS), speech-to-speech translation, mimicry generation and human-machine interaction systems are among the numerous applications which can be greatly benefited by having a voice conversion module. Generally voice conversion systems require parallel data between source and target speakers. Parallel data is a set of utterances recorded by both source and target speakers. By having such data, one can build a mapping function at frame level to transform characteristics of source speaker to a specified target speaker using machine learning techniques (GMMs, ANNs, etc,.). These techniques are assumed to perform well as humans typically perceive transformed speech to sound more like the target speaker than the source speaker. But having parallel data is not always feasible, especially in cross-lingual voice conversion where the language of source and target speakers is different. In literature, voice conversion techniques have been proposed which do not require parallel data. But they require speech data apriori from source speaker. These techniques cannot be applied when an arbitrary source speaker wants to transform his/her voice to a target speaker without any apriori recording. In this dissertation, we propose a method to perform voice conversion without the need of training data from the source speaker. It alleviates the need for any speech data from source speaker apriori, and can be used for cross-lingual voice conversion system. In this method, we capture speaker-specific characteristics of target speaker. The problem of capturing speaker-specific characteristics can be viewed as modelling a noisy-channel model. The idea behind modelling a noisy-channel is as follows. Suppose, C is a canonical form of a speech signal (a generic and speaker-independent representation of the message in iii speech signal) passes through the speech production system of a target speaker to produce a surface form S . This surface form S carries the message as well as the identity of the speaker. One can interpret S as the output of a noisy-channel, for the input C. Here, the noisy-channel is the speech production system of the target speaker. We used an artificial neural network (ANN) to model the speech production system of a target speaker, which captures the essential speaker-characteristics of the target speaker. The choice of representation of C and S of a speech signal plays an important role in this method. We used articulatory features (AFs), which represents the characteristics of speech production process, as canonical form or speaker-independent representation of speech signal as they assumed to be speaker independent. But our analysis showed that AFs contain significant amount of speaker information in their trajectories. Thus, we propose suitable techniques to normalize the speaker-specific information in AF trajectories and the resultant AFs are used for voice conversion. We show that the proposed method could be used to alleviate the need for source speaker data, and in cross-lingual voice conversion. Subjective and objective evaluations reveal that the quality of the transformed speech using the proposed approach is intelligible and possess the characteristics of the target speaker. A set of transformed utterances corresponding to results discussed in this work are available for listening at http://researchweb.iiit.ac.in/ ̃bajibabu.b/vc_evaluation.html
منابع مشابه
Modeling a Noisy-channel for Voice Conversion Using Articulatory Features
In this paper, we propose modeling a noisy-channel for the task of voice conversion (VC). We have used the artificial neural networks (ANN) to capture speaker-specific characteristics of a target speaker which avoid the need for any training utterance from a source speaker. We use articulatory features (AFs) as a canonical form or speaker-independent representation of a speech signal. Our studi...
متن کاملModelling a Noisy-channel for Voice Conversion Using Articulatory Features
In this paper, we propose modeling a noisy-channel for the task of voice conversion (VC). We have used the artificial neural networks (ANN) to capture speaker-specific characteristics of a target speaker which avoid the need for any training utterance from a source speaker. We use articulatory features (AFs) as a canonical form or speaker-independent representation of a speech signal. Our studi...
متن کاملطراحی یک روش آموزش ناموازی جدید برای تبدیل گفتار با عملکردی بهتر از آموزش موازی
Introduction: The art of voice mimicking by computers, has with the computer have been one of the most challenging topics of speech processing in recent years. The system of voice conversion has two sides. In one side, the speaker is the source that his or her voice has been changed for mimicking the target speaker’s voice (which is on the other side). Two methods of p...
متن کاملSpeaker adaptation of an acoustic-articulatory inversion model using cascaded Gaussian mixture regressions
The article presents a method for adapting a GMM-based acoustic-articulatory inversion model trained on a reference speaker to another speaker. The goal is to estimate the articulatory trajectories in the geometrical space of a reference speaker from the speech audio signal of another speaker. This method is developed in the context of a system of visual biofeedback, aimed at pronunciation trai...
متن کاملSpeaker adaptation of an acoustic-to-articulatory inversion model using cascaded Gaussian mixture regressions
The article presents a method for adapting a GMM-based acoustic-articulatory inversion model trained on a reference speaker to another speaker. The goal is to estimate the articulatory trajectories in the geometrical space of a reference speaker from the speech audio signal of another speaker. This method is developed in the context of a system of visual biofeedback, aimed at pronunciation trai...
متن کاملUsing Articulatory Position Data to Improve Voice Transformation
Voice transformation (also known as voice conversion or voice morphing) is a name given to techniques which take speech from one speaker as input and attempt to produce speech that sounds like it came from another speaker. One compelling argument for good voice transformation is that it reduces the difficulty in creating additional synthetic voices with new identities and styles once an existin...
متن کامل